IDS Lab Project: DENGUE CASES ANALYSIS IN PAKISTAN¶

Group Members:

Anousha Gul

Muhammad Ahmed

Subject: Lab Project

Instructor: Sir Jawad

Step 1 — Import Libraries¶

Start with the core imports used throughout the workflow.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    mean_squared_error, r2_score,
    accuracy_score, confusion_matrix, classification_report
)


sns.set(style="whitegrid")

2 — Load Dataset¶

In [2]:
df = pd.read_csv('DENGUE__Pakistan.csv')
print("Dataset loaded. Shape:", df.shape)
df.head()
Dataset loaded. Shape: (10240, 11)
Out[2]:
date year month province district suspected_cases confirmed_cases deaths temperature rainfall humidity
0 1/1/2016 2016 1.0 Punjab Lahore 5.0 1.0 0.0 24.6 1.4 40.8
1 1/1/2016 2016 1.0 Punjab Rawalpindi 2.0 0.0 0.0 23.4 1.6 43.5
2 1/1/2016 2016 1.0 Punjab Multan 4.0 1.0 0.0 25.2 2.7 43.2
3 1/1/2016 2016 1.0 Sindh Karachi 4.0 3.0 0.0 27.1 0.1 NaN
4 1/1/2016 NaN 1.0 Sindh Hyderabad 10.0 6.0 0.0 26.4 4.0 40.3
In [3]:
print(df.columns)
Index(['date', 'year', 'month', 'province', 'district', 'suspected_cases',
       'confirmed_cases', 'deaths', 'temperature', 'rainfall', 'humidity'],
      dtype='object')

Fix date column (##### problem)

In [4]:
df['date'] = pd.to_datetime(df['date'], errors='coerce') 
df = df.dropna(subset=['date'])

Convert ALL columns

Step 1: Convert numeric columns to numeric type and fill missing values

In [5]:
numeric_columns = ['suspected_cases', 'confirmed_cases', 'deaths', 
                   'temperature', 'rainfall', 'humidity', 'year', 'month']
for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce') 
    df[col] = df[col].fillna(df[col].median())  

Step 2: Clean text columns (like 'province', 'district')

In [6]:
text_columns = ['province', 'district']
for col in text_columns:
    df[col] = df[col].astype(str).str.strip()
    df[col] = df[col].replace({'nan': None, 'NaN': None})

Step 3: Drop rows that still have missing values after cleaning

In [7]:
df = df.dropna()
In [8]:
df.isnull().sum()
Out[8]:
date               0
year               0
month              0
province           0
district           0
suspected_cases    0
confirmed_cases    0
deaths             0
temperature        0
rainfall           0
humidity           0
dtype: int64

Fix column names (clean)

In [9]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df['province'] = df['province'].str.lower().str.capitalize()

Save cleaned dataset

In [10]:
df.to_csv("cleaned2_dengue_data.csv", index=False)
print("Cleaned dataset saved. Shape:", df.shape)
Cleaned dataset saved. Shape: (10237, 11)

HOW MANY TIMES ONE VALUE APPERAS IN EACH COLUMN

In [11]:
print("Suspected Cases Frequency:\n", df['suspected_cases'].value_counts())
Suspected Cases Frequency:
 suspected_cases
4.0     1483
5.0     1475
6.0     1245
3.0     1139
7.0      989
8.0      766
2.0      761
9.0      552
10.0     396
1.0      388
11.0     301
12.0     191
13.0     134
14.0     106
0.0       72
15.0      72
16.0      33
18.0      20
17.0      19
19.0      15
20.0      15
21.0       9
33.0       4
22.0       4
28.0       4
43.0       4
27.0       3
46.0       3
26.0       3
25.0       2
53.0       2
42.0       2
60.0       2
59.0       2
37.0       2
40.0       2
31.0       1
63.0       1
48.0       1
23.0       1
49.0       1
74.0       1
45.0       1
35.0       1
34.0       1
29.0       1
89.0       1
72.0       1
66.0       1
80.0       1
83.0       1
92.0       1
44.0       1
Name: count, dtype: int64
In [12]:
print("Confirmed Cases Frequency:\n", df['confirmed_cases'].value_counts())
Confirmed Cases Frequency:
 confirmed_cases
2.0     2215
1.0     1982
3.0     1941
4.0     1346
0.0      881
5.0      808
6.0      513
7.0      240
8.0      132
9.0       62
10.0      26
11.0      19
12.0      12
13.0       7
22.0       5
16.0       4
15.0       4
31.0       4
29.0       4
27.0       4
17.0       3
28.0       2
20.0       2
19.0       2
14.0       2
47.0       2
24.0       1
41.0       1
40.0       1
35.0       1
44.0       1
52.0       1
33.0       1
60.0       1
38.0       1
45.0       1
26.0       1
18.0       1
56.0       1
55.0       1
36.0       1
Name: count, dtype: int64
In [13]:
print("Deaths Frequency:\n", df['deaths'].value_counts())
Deaths Frequency:
 deaths
0.0    9951
1.0     273
2.0      11
3.0       2
Name: count, dtype: int64
In [14]:
print("Temperature Frequency:\n", df['temperature'].value_counts())
Temperature Frequency:
 temperature
22.5    92
23.8    84
20.8    83
22.9    80
23.0    79
        ..
7.8      1
9.6      1
36.2     1
36.4     1
8.8      1
Name: count, Length: 279, dtype: int64
In [15]:
print("Rainfall Frequency:\n", df['rainfall'].value_counts())
Rainfall Frequency:
 rainfall
0.8      226
0.7      218
0.9      214
0.6      213
1.3      210
        ... 
306.5      1
83.6       1
84.2       1
61.8       1
82.0       1
Name: count, Length: 976, dtype: int64
In [16]:
print(df.columns)
Index(['date', 'year', 'month', 'province', 'district', 'suspected_cases',
       'confirmed_cases', 'deaths', 'temperature', 'rainfall', 'humidity'],
      dtype='object')
In [17]:
df.iloc[20:35, : ]
Out[17]:
date year month province district suspected_cases confirmed_cases deaths temperature rainfall humidity
20 2016-01-03 2016.0 1.0 Balochistan Quetta 8.0 5.0 0.0 16.9 1.3 48.0
21 2016-01-04 2016.0 1.0 Punjab Lahore 8.0 4.0 0.0 22.5 4.9 41.6
22 2016-01-04 2016.0 1.0 Punjab Rawalpindi 7.0 5.0 0.0 21.5 2.5 60.8
23 2016-01-04 2016.0 1.0 Punjab Multan 5.0 3.0 0.0 26.2 1.8 37.3
24 2016-01-04 2016.0 1.0 Sindh Karachi 4.0 4.0 0.0 27.6 4.8 49.0
25 2016-01-04 2016.0 1.0 Sindh Hyderabad 0.0 0.0 0.0 27.8 0.7 38.9
26 2016-01-04 2016.0 1.0 Khyber pakhtunkhwa Peshawar 1.0 1.0 0.0 21.6 2.3 51.9
27 2016-01-04 2016.0 1.0 Balochistan Quetta 3.0 1.0 0.0 17.1 1.2 48.0
28 2016-01-05 2016.0 1.0 Punjab Lahore 8.0 4.0 0.0 24.8 1.4 43.3
29 2016-01-05 2016.0 1.0 Punjab Rawalpindi 6.0 4.0 0.0 23.8 3.8 46.5
30 2016-01-05 2016.0 1.0 Punjab Multan 6.0 2.0 0.0 25.6 2.1 42.8
31 2016-01-05 2016.0 1.0 Sindh Karachi 11.0 5.0 0.0 27.7 1.3 47.0
32 2016-01-05 2016.0 1.0 Sindh Hyderabad 5.0 5.0 0.0 27.9 3.2 48.7
33 2016-01-05 2016.0 1.0 Khyber pakhtunkhwa Peshawar 3.0 0.0 0.0 22.5 1.4 39.8
34 2016-01-05 2016.0 1.0 Balochistan Quetta 5.0 2.0 0.0 13.6 4.3 40.9

4 — Exploratory Data Analysis¶

Distribution of Suspected Cases

In [18]:
plt.figure(figsize=(10,5))
sns.histplot(df['suspected_cases'], bins=8, kde=True, color='salmon')
plt.title('Distribution of Suspected Dengue Cases', fontsize=14)
plt.xlabel('Suspected Cases')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Percentage of Deaths by Province

In [19]:
death_data = pd.DataFrame({
    'Province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan'],
    'Deaths': [120, 80, 50, 30],
    'Suspected_Cases': [500, 400, 300, 200]
})
In [20]:
plt.figure(figsize=(8,8))
colors = sns.color_palette('Set2')
plt.pie(death_data['Deaths'], labels=death_data['Province'], autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Percentage of Deaths by Province', fontsize=14)
plt.axis('equal')
plt.show()
No description has been provided for this image

Confirmed Cases vs Province

In [21]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x='province', y='confirmed_cases', hue='province', palette='Pastel1', dodge=False, legend=False)
plt.title('Confirmed Dengue Cases by Province', fontsize=14)
plt.xlabel('Province')
plt.ylabel('Confirmed Cases')
plt.show()
No description has been provided for this image

A_Monthly confirmed cases

In [22]:
df["year_month"] = df["date"].dt.to_period("M").astype(str)
monthly = df.groupby("year_month")["confirmed_cases"].sum().reset_index()

plt.figure(figsize=(12,4))
plt.plot(monthly["year_month"], monthly["confirmed_cases"], marker="o")
plt.xticks(monthly.index[::3], monthly["year_month"][::3], rotation=45)
plt.title("Monthly Confirmed Dengue Cases")
plt.xlabel("Year-Month")
plt.ylabel("Confirmed Cases")
plt.tight_layout()
plt.show()
No description has been provided for this image

B_ Yearly confirmed cases

In [23]:
df['month'] = df['month'].astype(int)
df['year'] = df['year'].astype(int)

monthly_cases = df.groupby(['year','month'])[['suspected_cases','confirmed_cases']].sum().reset_index()

pivot_suspected = monthly_cases.pivot(index='month', columns='year', values='suspected_cases')
pivot_confirmed = monthly_cases.pivot(index='month', columns='year', values='confirmed_cases')

plt.figure(figsize=(15,8))

for year in pivot_suspected.columns:
    plt.plot(pivot_suspected.index,
             pivot_suspected[year],
             marker='o',
             label=f'Suspected {year}')

for year in pivot_confirmed.columns:
    plt.plot(pivot_confirmed.index,
             pivot_confirmed[year],
             marker='x',
             linestyle='--',
             label=f'Confirmed {year}')

plt.title("Monthly Trend Of Dengue Cases For All Years")
plt.xlabel("month")
plt.ylabel("Number of cases")
plt.xticks(range(1, 13))
plt.legend()
plt.grid(True)
plt.show() 
No description has been provided for this image

C_ Province-wise confirmed cases

In [24]:
province_wise = df.groupby("province")["confirmed_cases"].sum().reset_index()
province_wise = province_wise.sort_values(by="confirmed_cases", ascending=False)
plt.figure(figsize=(12,6))
plt.bar(province_wise["province"], province_wise["confirmed_cases"], color="black")
plt.xticks(rotation=45) 
plt.title("Confirmed Dengue Cases by Province")
plt.xlabel("Province")
plt.ylabel("Confirmed Cases")
plt.tight_layout() 
plt.show()
No description has been provided for this image

Correlation heatmap

In [25]:
corr = df[["suspected_cases","confirmed_cases","humidity","temperature","rainfall","deaths"]].corr()
plt.figure(figsize=(10,8))
sns.heatmap(
    corr, 
    annot=True, 
    fmt=".2f",   
    cmap="YlGnBu",      
    linewidths=0.5,      
    linecolor='white', 
    cbar_kws={"shrink":0.8},
    annot_kws={"size":10, "weight":"bold"}
)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.title("Correlation Between Numerical Features", fontsize=14, weight="bold")
plt.tight_layout()
plt.show()
No description has been provided for this image

5 — Regression Models (Train & Predict)¶

1_Split for regression

In [26]:
X = df[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']]
y_reg = df['confirmed_cases']

2_Initialize models

In [27]:
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
    X, y_reg, test_size=0.3, random_state=42
)

3_Train models

In [28]:
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train_lr, y_train_lr)
Out[28]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

4_Predict on test set

In [29]:
predictions_linear = linear_regression_model.predict(X_test_lr)

5_Evaluate

In [30]:
print("Linear Regression MSE:", mean_squared_error(y_test_lr, predictions_linear))
print("Linear Regression R²:", r2_score(y_test_lr, predictions_linear))
Linear Regression MSE: 1.1304187765144633
Linear Regression R²: 0.8280726113384165

Random Forest Regression¶

1 Split data (can use same X and y)

In [31]:
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(
    X, y_reg, test_size=0.3, random_state=42
)

2 Create and train Random Forest Regressor

In [32]:
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train_rf, y_train_rf)
Out[32]:
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)

3 Make predictions

In [33]:
predictions_random_forest = random_forest_model.predict(X_test_rf)

4 Evaluate the model

In [34]:
print("Random Forest Regression MSE:", mean_squared_error(y_test_rf, predictions_random_forest))
print("Random Forest Regression R²:", r2_score(y_test_rf, predictions_random_forest))
Random Forest Regression MSE: 1.1647791341145834
Random Forest Regression R²: 0.8228466838517181

6 — Linear Regression Predictions & Visualization¶

New environmental conditions for 2026

In [35]:
new_data_lr = pd.DataFrame({
    'province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan'],
    'suspected_cases': [10, 20, 15, 8],
    'deaths': [1, 2, 1, 1],
    'temperature': [25, 30, 28, 32],
    'rainfall': [5, 10, 7, 4],
    'humidity': [50, 60, 55, 45],
    'year': [2026, 2026, 2026, 2026],
    'month': [6, 7, 8, 9]
})

Predict using trained models

In [36]:
pred_linear = linear_regression_model.predict(new_data_lr.drop(columns='province'))
new_data_lr['Pred_Linear'] = pred_linear

Combine predictions with the input for clarity

In [37]:
highest_linear_idx = new_data_lr['Pred_Linear'].idxmax()

Find the province with highest expected cases

In [38]:
print("Linear Regression Predictions for 2026:")
print(new_data_lr)
print("Highest expected cases by Linear Regression:")
print(new_data_lr.loc[highest_linear_idx])
Linear Regression Predictions for 2026:
             province  suspected_cases  deaths  temperature  rainfall  \
0              Punjab               10       1           25         5   
1               Sindh               20       2           30        10   
2  Khyber Pakhtunkhwa               15       1           28         7   
3         Balochistan                8       1           32         4   

   humidity  year  month  Pred_Linear  
0        50  2026      6     6.238485  
1        60  2026      7    12.988152  
2        55  2026      8     9.138210  
3        45  2026      9     5.274889  
Highest expected cases by Linear Regression:
province               Sindh
suspected_cases           20
deaths                     2
temperature               30
rainfall                  10
humidity                  60
year                    2026
month                      7
Pred_Linear        12.988152
Name: 1, dtype: object

VISUALIZATION

In [39]:
plt.figure(figsize=(8,6))
sns.barplot(data=new_data_lr, x='province', y='Pred_Linear', color='salmon')

plt.text(highest_linear_idx, new_data_lr.loc[highest_linear_idx, 'Pred_Linear'] + 0.5,
         f"Highest: {new_data_lr.loc[highest_linear_idx, 'Pred_Linear']:.1f}",
         color='red', ha='center', fontweight='bold')

plt.title('Linear Regression Predicted Cases for 2026', fontsize=14)
plt.xlabel('Province')
plt.ylabel('Predicted Cases')
plt.show()
No description has been provided for this image

Random Forest Regression Predictions & Visualization¶

1 New data for 2026 (copy to keep separate)

In [40]:
new_data_rf = new_data_lr.copy()

2 Predict using Random Forest

In [41]:
pred_rf = random_forest_model.predict(new_data_rf[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']])
new_data_rf['Pred_RandomForest'] = pred_rf

3 Find highest predicted case

In [42]:
highest_rf_idx = new_data_rf['Pred_RandomForest'].idxmax()

4 Show predictions

In [43]:
print("Random Forest Regression Predictions for 2026:")
print(new_data_rf)
print("Highest expected cases by Random Forest Regression:")
print(new_data_rf.loc[highest_rf_idx])
Random Forest Regression Predictions for 2026:
             province  suspected_cases  deaths  temperature  rainfall  \
0              Punjab               10       1           25         5   
1               Sindh               20       2           30        10   
2  Khyber Pakhtunkhwa               15       1           28         7   
3         Balochistan                8       1           32         4   

   humidity  year  month  Pred_Linear  Pred_RandomForest  
0        50  2026      6     6.238485               5.66  
1        60  2026      7    12.988152              10.54  
2        55  2026      8     9.138210               7.96  
3        45  2026      9     5.274889               4.33  
Highest expected cases by Random Forest Regression:
province                 Sindh
suspected_cases             20
deaths                       2
temperature                 30
rainfall                    10
humidity                    60
year                      2026
month                        7
Pred_Linear          12.988152
Pred_RandomForest        10.54
Name: 1, dtype: object
In [44]:
# Visualization
In [45]:
plt.figure(figsize=(8,6))
sns.barplot(data=new_data_rf, x=new_data_rf.index, y='Pred_RandomForest', color='skyblue')

plt.text(highest_rf_idx, new_data_rf.loc[highest_rf_idx, 'Pred_RandomForest'] + 0.5,
         f"Highest: {new_data_rf.loc[highest_rf_idx, 'Pred_RandomForest']:.1f}",
         color='red', ha='center', fontweight='bold')

plt.title('Random Forest Predicted Cases for 2026', fontsize=14)
plt.xlabel('Data Row')
plt.ylabel('Predicted Cases')
plt.show()
No description has been provided for this image

7— Classification Models (Train & Predict)¶

Step 1 — Prepare dataset for classification¶

Define classification target (High dengue cases = 1 if confirmed_cases > threshold)

In [46]:
threshold = 10
y_clf = (df['confirmed_cases'] > threshold).astype(int)

Features

In [47]:
X = df[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']]

Scale features

In [48]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Train-test split

In [49]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_clf, test_size=0.3, random_state=42
)

Step 2 — KNN Classifier¶

Initialize and train KNN

In [50]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
Out[50]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()

Predict on test set

In [51]:
pred_knn = knn_classifier.predict(X_test)

Accuracy

In [52]:
print("KNN Accuracy:", accuracy_score(y_test, pred_knn))
KNN Accuracy: 0.9967447916666666

Step 3 — Random Forest Classifier¶

Initialize and train Random Forest

In [53]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
Out[53]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)

Predict on test set

In [54]:
pred_rf = rf_classifier.predict(X_test)

Accuracy

In [55]:
print("Random Forest Accuracy:", accuracy_score(y_test, pred_rf))
Random Forest Accuracy: 0.9973958333333334

Step 4 — Predict new cases for 2025¶

New data for 2025 (example values for each province)

In [56]:
new_data_2025 = pd.DataFrame({
    'suspected_cases': [12, 18, 15, 10],  
    'deaths': [1, 2, 1, 1],
    'temperature': [26, 30, 28, 25],
    'rainfall': [6, 12, 8, 4],
    'humidity': [55, 65, 60, 50],
    'year': [2025, 2025, 2025, 2025],
    'month': [7, 7, 7, 7],
    'province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan']
})

Scale features

In [57]:
X_new_scaled = scaler.transform(new_data_2025.drop(columns=['province']))

Predictions

In [58]:
new_data_2025['Pred_KNN'] = knn_classifier.predict(X_new_scaled)
new_data_2025['Pred_RF'] = rf_classifier.predict(X_new_scaled)

print(new_data_2025)
   suspected_cases  deaths  temperature  rainfall  humidity  year  month  \
0               12       1           26         6        55  2025      7   
1               18       2           30        12        65  2025      7   
2               15       1           28         8        60  2025      7   
3               10       1           25         4        50  2025      7   

             province  Pred_KNN  Pred_RF  
0              Punjab         0        0  
1               Sindh         1        1  
2  Khyber Pakhtunkhwa         0        0  
3         Balochistan         0        0  

Step 5 — Visualization¶

Prepare data for plotting KNN predictions

In [59]:
plot_knn = new_data_2025[['province', 'Pred_KNN']]
plt.figure(figsize=(8,6))
sns.barplot(data=plot_knn, x='province', y='Pred_KNN', color='skyblue')
plt.title('Predicted High Dengue Cases (KNN) for 2025 by Province', fontsize=14)
plt.ylabel('High Cases (1=Yes, 0=No)')
plt.xlabel('Province')
plt.ylim(0, 1.5)
plt.show()
No description has been provided for this image

Prepare data for plotting Random Forest predictions

In [60]:
plot_rf = new_data_2025[['province', 'Pred_RF']]
plt.figure(figsize=(8,6))
sns.barplot(data=plot_rf, x='province', y='Pred_RF', color='lightgreen')
plt.title('Predicted High Dengue Cases (Random Forest) for 2025 by Province', fontsize=14)
plt.ylabel('High Cases (1=Yes, 0=No)')
plt.xlabel('Province')
plt.ylim(0, 1.5)
plt.show()
No description has been provided for this image

8_Interactive Scatter Plot¶

In [61]:
import plotly.express as px

1 Default template

In [62]:
DEFAULT_TEMPLATE = "plotly_dark"

1 visualising how temperature affects confirmed dengue cases also showing humidity as a colour scale¶

In [63]:
print("--- Task 01: Dengue Scatter Plot Description ---")
fig1 = px.scatter(
    df,
    x="temperature",
    y="confirmed_cases",
    color="humidity",
    hover_data=df.columns,
    title="Task 01: Temperature vs Dengue Confirmed Cases"
)
fig1.update_layout(template=DEFAULT_TEMPLATE)
fig1.show()
--- Task 01: Dengue Scatter Plot Description ---

2 visualising the effect of temperature, humidity, and rainfall together on confirmed dengue cases¶

In [99]:
print("--- Task 02: Dengue 3D Scatter Plot Description ---")
fig2 = px.scatter_3d(
    df,
    x="temperature",
    y="humidity",
    z="rainfall",
    color="confirmed_cases",
    hover_data=df.columns,
    title="Task 02: Dengue Environmental Factors in 3D"
)
fig2.update_layout(template=DEFAULT_TEMPLATE)
fig2.show()
--- Task 02: Dengue 3D Scatter Plot Description ---

Explanation¶

The 3D scatter plot visualises the relationship between dengue cases and environmental factors like temperature, humidity, and rainfall. Each point represents a data record, with its position determined by these three factors, and the colour indicating the number of confirmed dengue cases. By hovering over a point, all details for that record can be seen. The plot helps identify patterns, showing that higher numbers of dengue cases tend to occur when temperature, humidity, and rainfall are all elevated, highlighting how these meteorological conditions contribute to outbreaks.

3 bubble plot visualizing province vs district is plotted¶

In [64]:
print("--- Task 03: Dengue Bubble Plot Description ---")
fig3 = px.scatter(
    df,
    x="province",
    y="district",
    size="confirmed_cases",
    color="rainfall",
    hover_data=df.columns,
    title="Task 03: Dengue Cases by Province and District"
)
fig3.update_layout(template=DEFAULT_TEMPLATE)
fig3.show()
--- Task 03: Dengue Bubble Plot Description ---

Explanation¶

The bubble plot shows the distribution of dengue cases across different provinces and districts. Each bubble represents a district, positioned according to its province on the X-axis and district on the Y-axis. The size of the bubble reflects the number of confirmed dengue cases, so larger bubbles indicate areas with more cases. The colour of the bubble represents rainfall, allowing visual comparison of how rainfall levels relate to dengue outbreaks. Hovering over a bubble displays all the data for that district, making it easy to explore specific details. Overall, the plot helps identify which districts and provinces are most affected and how rainfall might influence dengue spread.

4 Histogram shows howing dengue confirmed cases | distributed across different months.¶

In [65]:
print("--- Task 04: Dengue Histogram Description ---")
fig4 = px.histogram(
    df,
    x="confirmed_cases",
    color="month",
    marginal="box",
    hover_data=df.columns,
    title="Task 04: Monthly Distribution of Dengue Confirmed Cases"
)
fig4.update_layout(template=DEFAULT_TEMPLATE)
fig4.show()
--- Task 04: Dengue Histogram Description ---

Explanation¶

The histogram shows the distribution of confirmed dengue cases across different months. The X-axis represents the number of confirmed cases, while the bars are coloured by month, allowing you to see which months have higher or lower case counts. The box plot on the margin provides a summary of the overall distribution, showing the median, quartiles, and potential outliers. Hovering over each bar displays detailed data for that record. This plot helps identify seasonal trends, highlighting months with the highest dengue activity and showing how case numbers vary over time.

5 treemap visualising dengue cases hierarchically¶

In [66]:
print("--- Task 05: Dengue Treemap Description ---")

df_tree = df.copy()

df_tree["confirmed_cases"] = df_tree["confirmed_cases"].replace(0, 1)

fig5 = px.treemap(
    df_tree,
    path=["province", "district"],
    values="confirmed_cases",
    color="temperature",
    hover_data=df.columns,
    title="Task 05: Dengue Cases Treemap by Province and District"
)
fig5.update_layout(template=DEFAULT_TEMPLATE)
fig5.show()
--- Task 05: Dengue Treemap Description ---

Explanation¶

The treemap visualises dengue cases by province and district, showing the relative size of outbreaks across regions. Each rectangle represents a district, nested within its province, and the size of the rectangle corresponds to the number of confirmed dengue cases, so larger rectangles indicate more cases. The colour represents temperature, allowing you to see how higher or lower temperatures relate to case numbers. Hovering over a rectangle shows all the data for that district. This plot helps quickly identify the provinces and districts most affected by dengue and highlights possible links between temperature and outbreak intensity.

6 time series line chart showing how dengue confirmed cases change over time.¶

In [67]:
print("--- Task 06: Dengue Time Series Description ---")
fig6 = px.line(
    df,
    x="date",
    y="confirmed_cases",
    color="province",
    hover_data=df.columns,
    title="Task 06: Time Series of Dengue Confirmed Cases"
)
fig6.update_layout(template=DEFAULT_TEMPLATE)
fig6.show()

print("\nAll dengue visualisations generated successfully.")
--- Task 06: Dengue Time Series Description ---
All dengue visualisations generated successfully.

Explanation¶

The time series line plot shows how confirmed dengue cases change over time across different provinces. The X-axis represents the date, while the Y-axis shows the number of confirmed cases. Each line corresponds to a province, allowing comparison of trends between regions. Hovering over a point displays detailed information for that date and province. This plot helps track outbreaks, identify peaks in cases, and observe seasonal patterns, making it easier to understand when and where dengue cases rise or fall over time.

9_ Pakistan Maps¶

In [68]:
from IPython.display import Image, display

display(Image(filename="PAK1.png"))  
print("PAK1") 


display(Image(filename="PAK.png"))  
print("PAK")  
No description has been provided for this image
PAK1
No description has been provided for this image
PAK

SUMMARY OF LAB PROJECT¶

This lab project provides a comprehensive analysis of dengue cases in Pakistan using a dataset spanning multiple years and provinces. The dataset was first cleaned and preprocessed by fixing date formats, converting numeric columns, handling missing values, and standardising text fields. Exploratory data analysis revealed the distribution of suspected and confirmed cases, the percentage of deaths by province, monthly and yearly trends, and correlations between environmental factors like temperature, humidity, and rainfall with dengue incidence. Regression models, including Linear Regression and Random Forest Regressor, were trained to predict confirmed cases, and predictions for 2026 were generated under hypothetical environmental conditions. Classification models, KNN and Random Forest Classifier, were implemented to identify high-risk dengue cases, with predictions for 2025 across provinces. Visualisations included histograms, boxplots, line charts, 3D scatter plots, bubble plots, treemaps, and interactive scatter plots to illustrate trends, distributions, and relationships between variables. Finally, a choropleth map of Pakistan displayed dengue cases by province, integrating geographical context. Overall, the project highlights the role of environmental factors in dengue outbreaks, identifies regions with higher risk, and demonstrates predictive modelling and data visualisation techniques for public health insights.

In [ ]:
 
In [ ]: